library(tidyverse) # for graphing and data cleaning
library(gardenR) # for Lisa's garden data
library(lubridate) # for date manipulation
library(ggthemes) # for even more plotting themes
library(geofacet) # for special faceting with US map layout
library(dplyr)
library(imputeTS)
Sys.setlocale("LC_TIME", "English")
## [1] "English_United States.1252"
theme_set(theme_minimal()) # My favorite ggplot() theme :)
# Lisa's garden data
data("garden_harvest")
# Seeds/plants (and other garden supply) costs
data("garden_spending")
# Planting dates and locations
data("garden_planting")
# Tidy Tuesday dog breed data
breed_traits <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_traits.csv')
trait_description <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/trait_description.csv')
breed_rank_all <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_rank.csv')
# Tidy Tuesday data for challenge problem
kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
Before starting your assignment, you need to get yourself set up on GitHub and make sure GitHub is connected to R Studio. To do that, you should read the instruction (through the “Cloning a repo” section) and watch the video here. Then, do the following (if you get stuck on a step, don’t worry, I will help! You can always get started on the homework and we can figure out the GitHub piece later):
keep_md: TRUE in the YAML heading. The .md file is a markdown (NOT R Markdown) file that is an interim step to creating the html file. They are displayed fairly nicely in GitHub, so we want to keep it and look at it there. Click the boxes next to these two files, commit changes (remember to include a commit message), and push them (green up arrow).Put your name at the top of the document.
For ALL graphs, you should include appropriate labels.
Feel free to change the default theme, which I currently have set to theme_minimal().
Use good coding practice. Read the short sections on good code with pipes and ggplot2. This is part of your grade!
When you are finished with ALL the exercises, uncomment the options at the top so your document looks nicer. Don’t do it before then, or else you might miss some important warnings and messages.
These exercises will reiterate what you learned in the “Expanding the data wrangling toolkit” tutorial. If you haven’t gone through the tutorial yet, you should do that first.
garden_harvest data to find the total harvest weight in pounds for each vegetable and day of week (HINT: use the wday() function from lubridate). Display the results so that the vegetables are rows but the days of the week are columns.totalGardenHarvest <- garden_harvest %>%
mutate(wt_lbs = weight * 0.00220462) %>%
mutate(weekday = wday(
ymd(date),
label = TRUE,
abbr = TRUE,
week_start = 7
)) %>%
group_by(vegetable, weekday) %>%
summarise(weekday_wt_lbs = sum(wt_lbs)) %>%
mutate(total_wt_lbs = sum(weekday_wt_lbs)) %>%
pivot_wider(names_from = weekday, values_from = weekday_wt_lbs) %>%
select("vegetable",
"Mon",
"Tue",
"Wed",
"Thu",
"Fri",
"Sat",
"Sun",
"total_wt_lbs")
totalGardenHarvest
garden_harvest data to find the total harvest in pound for each vegetable variety and then try adding the plot from the garden_planting table. This will not turn out perfectly. What is the problem? How might you fix it?varGardenHarvest <- garden_harvest %>%
mutate(wt_lbs = weight * 0.00220462) %>%
group_by(vegetable, variety) %>%
summarise(to_varwt_lbs = sum(wt_lbs)) %>%
left_join(garden_planting,
by = c("vegetable", "variety"))
Some vegetables, although the harvest date and weight are all the same, there are more than one different plots. This will be a problem because there will be a row for each plot. This will happen every time that vegetable variety is harvested. There were also vegetable varieties that were planted on multiple dates. This will lead to a similar problem. It is hard to decide how to fix the problem at this point since we haven’t figured out which plots the weights belong to. However, one possible way to fix it is to put all the plots into one row and sum up all the seeds planted in different plots into one number as well. For example ,
Example
garden_harvest and garden_spending datasets, along with data from somewhere like this to answer this question. You can answer this in words, referencing various join functions. You don’t need R code but could provide some if it’s helpful.1.In garden_harvest, group_by(vegetable, variety) and then summarize the weight of them.(calculate the total harvest weight of each variety in each vegetable) 2.join the two dataset 3.joining the data from the grocery store to get the corresponding price for the vegetable 4.compute the gross saving by times the grams with the price in the grocery shop of a variety of vegetable 5.use the gross saving to minus the corresponding spending of the variety to get the actually saving for each variety 6.sum the actually saving to get the overall saving
tGardenHarvest <- garden_harvest %>%
filter(vegetable == "tomatoes") %>%
mutate(wt_lbs = weight * 0.00220462) %>%
group_by(variety) %>%
summarise(total_weight = sum(weight),
first_plant_date = min(date)) %>%
arrange(ymd(first_plant_date))
tGardenHarvest %>%
ggplot(aes(x=total_weight,y = fct_reorder(variety,first_plant_date)))+
labs(title = "Harvest In Pounds for Each Variety of Tomatoes",x=NULL, y="Variety") +
geom_bar(color = "white",fill="deepskyblue4",stat = "identity")
garden_harvest data, create two new variables: one that makes the varieties lowercase and another that finds the length of the variety name. Arrange the data by vegetable and length of variety name (smallest to largest), with one row for each vegetable variety. HINT: use str_to_lower(), str_length(), and distinct().orderGardenHarvest<-garden_harvest %>%
mutate(low_case_var=str_to_lower(variety)) %>%
mutate(length_var=str_length(variety)) %>%
group_by(low_case_var) %>%
arrange(vegetable,length_var) %>%
distinct(vegetable,low_case_var, .keep_all = TRUE) %>%
select(vegetable,low_case_var,length_var )
garden_harvest data, find all distinct vegetable varieties that have “er” or “ar” in their name. HINT: str_detect() with an “or” statement (use the | for “or”) and distinct().filGardenHarvest<-garden_harvest %>%
mutate(var_name=str_detect(variety,"er|ar")) %>%
filter(var_name==TRUE) %>%
distinct(vegetable, variety)
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
A typical Capital Bikeshare station. This one is at Florida and California, next to Pleasant Pops.
One of the vans used to redistribute bicycles to different stations.
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data. We do this a little differently than usual, which is why it is included here rather than at the top of this file. To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data-Small.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
NOTE: The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you should access the full data set of more than 600,000 events by removing -Small from the name of the data_site.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started. Make the following plots and interpret them:
sdate. Use geom_density().Trips %>%
ggplot() +
geom_density(
aes(x = sdate),
fill = "deepskyblue4",
color = "#e9ecef",
alpha = 0.9,
adjust = 0.5
) +
labs(title = "Distrubution of Rental Started Time(Date) For Events", x = NULL, y =
NULL)
> The density plot shows a multimodal right-skewed distribution of the rental started time of the events. It covers the events which the rental started time is between Oct and Jan. In addition since the graph is right-skewed, it seems that the more events prefer to start the rental time during Oct till Dec, instead of Dec till Jan.
mutate() with lubridate’s hour() and minute() functions to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a variable where 3:30 is 3.5 and 3:45 is 3.75.hMTrip <- Trips %>%
mutate(times = hour(sdate) + minute(sdate) / 60)
hMTrip %>%
ggplot() +
geom_density(
aes(x = times),
fill = "deepskyblue4",
color = "#e9ecef"
) +
labs(title = "Distrubution of Rental Started Time For Events", x = "Rental Started Time(Hour)", y =
NULL)
> This density plot shows a multimodal slightly left-skewed distribution of started rental time for events in a day. Since there are only 24 hours in a day, it covers from 0:00 till 24:00. The graph shows that it is relatively rare for the events to start their rental time between 0:00 to 5:00.
wTrip <- Trips %>%
mutate(weekday = wday(sdate,
label = TRUE,
abbr = TRUE))
wTrip %>%
ggplot()+
geom_bar(aes(y=fct_rev(weekday)),color = "white",fill="deepskyblue4")+
labs(title = "Number of Events On Each day of Week",x=NULL, y=NULL)
>According to the bar plot, most trips occurred on Friday (about 1500 times) and least on Sunday (about 1250 times). However, the number of trips each day of the week does not show much difference even between Fri and Sun.
fTrip<-hMTrip %>%
mutate(weekday = wday(sdate,
label = TRUE,
abbr = TRUE))
fTrip %>%
ggplot() +
geom_density(
aes(x = times),
fill = "deepskyblue4",
color = "#e9ecef"
) +
labs(title = "Distrubution of Rental Started Time For Events", x = "Rental Started Time(Hour)", y =
NULL) +
facet_wrap( ~ weekday)+
theme_minimal()
> Yes. The patterns of Saturday and Sunday are very similar which are left-skewed and bimodal. By contrast, the patterns for weekdays are multimodal and very slightly left-skewed.
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). The next set of exercises investigate whether these two different categories of users show different rental behavior and how client interacts with the patterns you found in the previous exercises.
fill aesthetic for geom_density() to the client variable. You should also set alpha = .5 for transparency and color=NA to suppress the outline of the density function.fTrip %>%
ggplot() +
geom_density(
aes(x = times,fill = client),
color = NA,
alpha = 0.5,
) +
labs(title = "Distrubution of Rental Started Time For Events", x = "Rental Started Time(Hour)", y =
NULL) +
facet_wrap( ~ weekday)
> The rental behavior for the two groups of clients is different. For the regular users who coherent with the pattern in problem 11 which patterns of weekends are different from the weekdays. By contrast, the patterns for weekdays are multimodel and very slightly left-skewed. However, for those who haven’t joined the bike-rental organization, the rental behavior is very similar every day– it shows a unimodal pattern from Sun through Mon.
position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?fTrip %>%
ggplot() +
geom_density(
aes(x = times,fill = client),
color = NA,
alpha = 0.5,
position = position_stack()
) +
labs(title = "Distrubution of Rental Started Time For Events", x = "Rental Started Time(Hour)", y =
NULL) +
facet_wrap( ~ weekday)
>In my opinion, I think it depends on the situation, since the graph in problem 11 is better for comparing the differences of distribution of times between the two groups while this one is more appropriate to look over the overall distribution and compare the proportion of the two groups at a particular time.
position = position_stack()). Add a new variable to the dataset called weekend which will be “weekend” if the day is Saturday or Sunday and “weekday” otherwise (HINT: use the ifelse() function and the wday() function from lubridate). Then, update the graph from the previous problem by faceting on the new weekend variable.wT<-fTrip %>%
mutate(weekend=ifelse(weekday%in%c("Sat","Sun"), "weekend","weekday"))
wT%>%
ggplot() +
geom_density(
aes(x = times,fill = client),
color = NA,
alpha = 0.5
) +
labs(title = "Distrubution of Rental Started Time For Events", x = "Rental Started Time(Hour)", y =
NULL) +
facet_wrap( ~ weekend)
>Coherent with the graphs from the previous questions– for registered clients, it shows a multimodal pattern on weekdays and bimodal on the weekend while for casual clients it shows a unimodal pattern for both weekdays and weekends.
client and fill with weekday. What information does this graph tell you that the previous didn’t? Is one graph better than the other?wT%>%
ggplot() +
geom_density(
aes(x = times,fill = weekend),
color = NA,
alpha = 0.5
) +
labs(title = "Distrubution of Rental Started Time For Events", x = "Rental Started Time(Hour)", y =
NULL) +
facet_wrap( ~ client)
>This graph is faceted on the client, so for each graph we are able to see the rental behavior of the same group on weekdays and weekends. The casual client has a similar pattern for weekdays and weekends which is centered around noon while the registered client shows a different behavior between the weekday and weekend– centered around the communicating hours on weekdays and noon on weekends. This is better for comparing the behavior within one group. In contrast, the previous graph is better at comparing the rental behavior between the two groups for weekdays and weekends. There is no better graph between the two. This only depends on the situation, the two graphs could answer different questions.
Stations to make a visualization of the total number of departures from each station in the Trips data. Use either color or size to show the variation in number of departures. We will improve this plot next week when we learn about maps!staTrip<-Trips %>%
left_join(Stations, by=c ("sstation"="name") ) %>%
group_by(lat,long) %>%
summarise(EventsCount=n())
staTrip %>%
ggplot(aes(x=long, y=lat,color=EventsCount))+
geom_point()+
scale_color_viridis_c()+
labs(title = "Total Number of Departures From Each Station", x = "Longtitude", y =
"Latitude")
> Most of the events are clustered around longitude between -77.1 and -77.0. There are several outliers on the upper left corner of the graph and the total number of departure for most of the stations were below 50 times.
staTrip2 <- Trips %>%
left_join(Stations, by = c ("sstation" = "name")) %>%
group_by(lat, long, client) %>%
summarise(numEventGroup = n()) %>%
mutate(totalEvent = sum(numEventGroup)) %>%
mutate(prop = numEventGroup / totalEvent) %>%
pivot_wider(names_from = client, values_from = prop) %>%
select(-c(Registered)) %>%
distinct(totalEvent, .keep_all = TRUE) %>%
ungroup() %>%
select("lat","long","Casual")
staTrip2%>%
ggplot(aes(x=long, y=lat,color=Casual))+
geom_point()+
scale_color_viridis_c()+
labs(title = "Proportion of Casual Clients' Departures From Each Station", x = "Longtitude", y =
"Latitude")
>There are several stations that have a large proportion of casual clients (almost 100%). In addition, there are several stations near latitude 38.9 that contain a relatively large proportion(between 0.5 to 1) of casual riders as well. Other stations except those mentioned above contain only a very small proportion of casual clients.
DID YOU REMEMBER TO GO BACK AND CHANGE THIS SET OF EXERCISES TO THE LARGER DATASET? IF NOT, DO THAT NOW.
In this section, we’ll use the data from 2022-02-01 Tidy Tuesday. If you didn’t use that data or need a little refresher on it, see the website.
breed_traits dataset on the x-axis, with a dot for each rating. First, create a new dataset called breed_traits_total that has two variables – Breed and total_rating. The total_rating variable is the sum of the numeric ratings in the breed_traits dataset (we’ll use this dataset again in the next problem). Then, create the graph just described. Omit Breeds with a total_rating of 0 and order the Breeds from highest to lowest ranked. You may want to adjust the fig.height and fig.width arguments inside the code chunk options (eg. {r, fig.height=8, fig.width=4}) so you can see things more clearly - check this after you knit the file to assure it looks like what you expected.breed_traits_total<-breed_traits %>%
select(-c("Coat Type", "Coat Length")) %>%
pivot_longer(!Breed, names_to = "level", values_to = "rating") %>%
group_by(Breed) %>%
summarise(total_rating=sum(rating))
breed_traits_total %>%
filter(total_rating != 0) %>%
arrange(desc(total_rating)) %>%
ggplot(aes(y = fct_reorder(Breed, total_rating), x = total_rating)) +
geom_point() +
labs(title = "Breeds Ranked BY Their Ratings", x = "Rating", y =
"Breeds") +
theme(
plot.title = element_text(hjust = 0.1, face = "bold", size = 15),
axis.text.y = element_text(
face = "bold",
hjust = 1,
vjust = 1
),
panel.spacing.x = unit(0.75, "cm")
)
breed_rank_all dataset). The points within each breed will be connected by a line, and the breeds should be arranged from the highest median rank to lowest median rank (“highest” is actually the smallest numer, eg. 1 = best). After you’re finished, think of AT LEAST one thing you could you do to make this graph better. HINTS: 1. Start with the breed_rank_all dataset and pivot it so year is a variable. 2. Use the separate() function to get year alone, and there’s an extra argument in that function that can make it numeric. 3. For both datasets used, you’ll need to str_squish() Breed before joining.newb_breed_traits_total <- breed_traits_total %>%
mutate(breedR = str_squish(Breed)) %>%
slice_max(n = 20, order_by = total_rating)
new_breed_rank_all <- breed_rank_all %>%
pivot_longer(cols = `2013 Rank`:`2020 Rank`,
names_to = "year",
values_to = "rank") %>%
separate(
col = year,
into = c("years", NULL),
sep = " ",
convert = TRUE
) %>%
mutate(breedR = str_squish(Breed)) %>%
inner_join(newb_breed_traits_total,
by = "breedR")
new_breed_rank_all %>%
ggplot(aes(y= fct_rev(fct_reorder(breedR, rank, median)), x=years))+
geom_point(aes(color=rank)) +
geom_line()+
labs(title = "Breeds Ranked By Years", x = NULL, y =
"Breeds")
> Assigned different sizes of dots for each rank . So that the rank could be more explicit and easier for the viewer to locate. For example, the largest dot for the highest rank, the second largest one for the second highest.
join or pivot function (or both, if you’d like), a str_XXX() function, and a fct_XXX() function to create a graph using any of the dog datasets. One suggestion is to try to improve the graph you created for the Tidy Tuesday assignment. If you want an extra challenge, find a way to use the dog images in the breed_rank_all file - check out the ggimage library and this resource for putting images as labels.aveBreed_rank_all <- breed_rank_all %>%
pivot_longer(cols = `2013 Rank`:`2020 Rank`,
names_to = "year",
values_to = "rank") %>%
group_by(Breed) %>%
summarise(ave_rank = sum(rank) / 8) %>%
mutate(newBreed = str_to_title(Breed)) %>%
slice_min(n = 20, order_by = ave_rank)
aveBreed_rank_all %>%
ggplot(aes(x = ave_rank, y = fct_rev(fct_reorder(newBreed, ave_rank)))) +
labs(title = "Average Rank of Breeds Between 2013 and 2020", x = "Rank", y =
"Breed") +
geom_bar(color = "white", fill = "deepskyblue4", stat = "identity")
This problem uses the data from the Tidy Tuesday competition this week, kids. If you need to refresh your memory on the data, read about it here.
facet_geo(). The graphic won’t load below since it came from a location on my computer. So, you’ll have to reference the original html on the moodle page to see it.DID YOU REMEMBER TO UNCOMMENT THE OPTIONS AT THE TOP?